Adriana de Vicente
Irma Sánchez
After getting our data preprocessing ready, we nevertheless need to make experience of it. In EDA we examine numerous plots and in reality allow the statistics information for better analysis. This step will provide us a deeper expertise of statistical understanding.
Lets import required packages and load data
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import matplotlib.pyplot as plt
import json
import collections
import re, string
import sys
import time
from nltk.corpus import stopwords
from wordcloud import WordCloud
#from mpl_toolkits.basemap import Basemap
from subprocess import check_output
import pandas as pd
import seaborn as sns
import networkx as nx
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import heapq
import folium
In this step, we open the file containing the review data and read it into a list. We then use this list to create a Pandas DataFrame.
data_file = open('../data/raw/yelp_academic_dataset_review.json',encoding='utf8')
data=[]
for line in data_file:
data.append(json.loads(line))
review_df = pd.DataFrame(data)
data_file.close()
In this step, we are reading in data from a json file into a Pandas DataFrame.
business_df = pd.read_json('../data/raw/yelp_academic_dataset_business.json',lines=True)
checkin_df = pd.read_json('../data/raw/yelp_academic_dataset_checkin.json',lines=True)
user_df = pd.read_json('../data/raw/yelp_academic_dataset_user.json',lines=True)
In this step, we are saving the business, review, and check-in dataframes to separate CSV files in the specified directory.
business_df.to_csv('../data/processed/business_df.csv', index=False)
review_df.to_csv('../data/processed/review_df.csv', index=False)
checkin_df.to_csv('../data/processed/checkin_df.csv', index=False)
user_df.to_csv('../data/processed/user_df.csv', index=False)
In this step, we are reading in the business data stored in a CSV file and storing it in a pandas DataFrame.
business_df = pd.read_csv('../data/processed/business_df.csv')
review_df = pd.read_csv('../data/processed/review_df.csv')
In this step, we are checking the shape of the business dataframe and printing the column names and the first few rows of the dataframe.
print(business_df.shape)
for col in business_df.columns:
print(col)
business_df.head()
(150346, 14) business_id name address city state postal_code latitude longitude stars review_count is_open attributes categories hours
| business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | attributes | categories | hours | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pns2l4eNsfO8kk83dixA6A | Abby Rappoport, LAC, CMQ | 1616 Chapala St, Ste 2 | Santa Barbara | CA | 93101 | 34.426679 | -119.711197 | 5.0 | 7 | 0 | {'ByAppointmentOnly': 'True'} | Doctors, Traditional Chinese Medicine, Naturop... | NaN |
| 1 | mpf3x-BjTdTEA3yCZrAYPw | The UPS Store | 87 Grasso Plaza Shopping Center | Affton | MO | 63123 | 38.551126 | -90.335695 | 3.0 | 15 | 1 | {'BusinessAcceptsCreditCards': 'True'} | Shipping Centers, Local Services, Notaries, Ma... | {'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ... |
| 2 | tUFrWirKiKi_TAnsVWINQQ | Target | 5255 E Broadway Blvd | Tucson | AZ | 85711 | 32.223236 | -110.880452 | 3.5 | 22 | 0 | {'BikeParking': 'True', 'BusinessAcceptsCredit... | Department Stores, Shopping, Fashion, Home & G... | {'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ... |
| 3 | MTSW4McQd7CbVtyjqoe9mw | St Honore Pastries | 935 Race St | Philadelphia | PA | 19107 | 39.955505 | -75.155564 | 4.0 | 80 | 1 | {'RestaurantsDelivery': 'False', 'OutdoorSeati... | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | {'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ... |
| 4 | mWMc6_wTdE0EUBKIGXDVfA | Perkiomen Valley Brewery | 101 Walnut St | Green Lane | PA | 18054 | 40.338183 | -75.471659 | 4.5 | 13 | 1 | {'BusinessAcceptsCreditCards': 'True', 'Wheelc... | Brewpubs, Breweries, Food | {'Wednesday': '14:0-22:0', 'Thursday': '16:0-2... |
The first line creates a new variable called "restaurant" and assigns to it a subset of the "business_df" DataFrame where the "categories" column is not empty (i.e., not "NaN" or "Not a Number"). The second line creates a new variable also called "restaurant" and assigns to it a subset of the previous "restaurant" DataFrame where the "categories" column contains the string "Restaurants".
restaurant = business_df[business_df["categories"].notna()]
restaurant = restaurant[restaurant["categories"].str.contains("Restaurants")]
o, the resulting map will show a series of CircleMarkers, each representing a restaurant location in the "restaurant_NA" DataFrame. The CircleMarkers will be displayed in red on the map.
x = restaurant["city"].value_counts()
x = x.iloc[:20]
plt.figure(figsize=(16,4))
ax = sns.barplot(x=x.index, y=x.values)
ax.set_xticklabels(x.index, rotation=60, ha='right')
plt.title("Cities with most restaurants reviewed by Yelp")
plt.show()
In the following chart, it can be observed that the city with the most restaurants is Philadelphia, with a considerable difference compared to the other cities.
This code is creating a new DataFrame called "ratings" and assigning to it a subset of the "restaurant" DataFrame. The subset includes only the columns specified in the list within the square brackets (i.e., "name", "city", "latitude", "longitude", "stars", and "review_count").
ratings = restaurant[["name","city","latitude","longitude","stars","review_count"]]
ratings
| name | city | latitude | longitude | stars | review_count | |
|---|---|---|---|---|---|---|
| 3 | St Honore Pastries | Philadelphia | 39.955505 | -75.155564 | 4.0 | 80 |
| 5 | Sonic Drive-In | Ashland City | 36.269593 | -87.058943 | 2.0 | 6 |
| 8 | Tsevi's Pub And Grill | Affton | 38.565165 | -90.321087 | 3.0 | 19 |
| 9 | Sonic Drive-In | Nashville | 36.208102 | -86.768170 | 1.5 | 10 |
| 11 | Vietnamese Food Truck | Tampa Bay | 27.955269 | -82.456320 | 4.0 | 10 |
| ... | ... | ... | ... | ... | ... | ... |
| 150325 | Wawa | Clifton Heights | 39.925656 | -75.310344 | 3.0 | 11 |
| 150327 | Dutch Bros Coffee | Boise | 43.615401 | -116.284689 | 4.0 | 33 |
| 150336 | Adelita Taqueria & Restaurant | Philadelphia | 39.935982 | -75.158665 | 4.5 | 35 |
| 150339 | The Plum Pit | Aston | 39.856185 | -75.427725 | 4.5 | 14 |
| 150340 | West Side Kebab House | Edmonton | 53.509649 | -113.675999 | 4.5 | 18 |
52268 rows × 6 columns
The plot shows the locations of restaurants in Philadelphia.
fig, ax = plt.subplots(figsize=(10, 10))
phi_lat, phi_lon = 39.9526, -75.1652
lon_min, lon_max = phi_lon - .2, phi_lon + .3
lat_min, lat_max = phi_lat - .1, phi_lat + .2
phi_mask = (restaurant['longitude']>lon_min) &\
(restaurant['longitude']<lon_max) &\
(restaurant['latitude']>lat_min) &\
(restaurant['latitude']<lat_max)
phi_restaurant = restaurant[phi_mask]
ax.scatter(
x=phi_restaurant['longitude'],
y=phi_restaurant['latitude'],
c='yellow',
s=0.2
)
ax.set_title("Restaurant Locations in Philadelphia")
ax.set_facecolor('black')
plt.show()
In the following chart, we can see that restaurants in Philadelphia are more concentrated in the center of the city than in the suburbs.
This code is creating four new variables called "rest_lat", "rest_lon", "rest_name", and "rest_colour" based on the data in the "phi_restaurant" DataFrame.
rest_lat = phi_restaurant['latitude']
rest_lon = phi_restaurant['longitude']
rest_name = phi_restaurant['name']
phi_restaurant['marker_colour'] = pd.cut(phi_restaurant['stars'],bins=4,
labels=['red','yellow','green','blue'])
rest_colour = phi_restaurant['marker_colour']
/var/folders/06/qj712s454zn0tljflkrgpwwm0000gn/T/ipykernel_2387/1779387711.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy phi_restaurant['marker_colour'] = pd.cut(phi_restaurant['stars'],bins=4,
m = folium.Map(location=[phi_lat, phi_lon])
feature_group = folium.FeatureGroup("Locations")
for lat, lng, name, color in zip(rest_lat, rest_lon, rest_name, rest_colour):
feature_group.add_child(folium.CircleMarker(location=[lat,lng], popup=name,
color=color, radius=3))
m.add_child(feature_group)